The assignment is focused on solving the Forest Cover Type Prediction: https://www.kaggle.com/c/forest-cover-type-prediction/overview. This task proposes a classification problem: predict the forest cover type (the predominant kind of tree cover) from strictly cartographic variables (as opposed to remotely sensed data).
The study area includes four wilderness areas located in the Roosevelt National Forest of northern Colorado. Each observation is a 30m x 30m patch. You are asked to predict an integer classification for the forest cover type. The seven types are:
The training set (15120 observations) contains both features and the Cover_Type. The test set contains only the features.
You must predict the Cover_Type for every row in the test set (565892 observations).
I expect 3 files from each group:
No group presentation is expected for this assignment.
%matplotlib inline
Our project focuses on a multi-class classification problem, we need to provide a prediction of the cover-types for the test set. Our goal focuses on finding the best possible algorithm that can predict with some level accuracy the correct cover types.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
dataframe = pd.read_csv('train.csv')
df = dataframe.copy()
df
| Id | Elevation | Aspect | Slope | Horizontal_Distance_To_Hydrology | Vertical_Distance_To_Hydrology | Horizontal_Distance_To_Roadways | Hillshade_9am | Hillshade_Noon | Hillshade_3pm | ... | Soil_Type32 | Soil_Type33 | Soil_Type34 | Soil_Type35 | Soil_Type36 | Soil_Type37 | Soil_Type38 | Soil_Type39 | Soil_Type40 | Cover_Type | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 2596 | 51 | 3 | 258 | 0 | 510 | 221 | 232 | 148 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 5 |
| 1 | 2 | 2590 | 56 | 2 | 212 | -6 | 390 | 220 | 235 | 151 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 5 |
| 2 | 3 | 2804 | 139 | 9 | 268 | 65 | 3180 | 234 | 238 | 135 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 |
| 3 | 4 | 2785 | 155 | 18 | 242 | 118 | 3090 | 238 | 238 | 122 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 |
| 4 | 5 | 2595 | 45 | 2 | 153 | -1 | 391 | 220 | 234 | 150 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 5 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 15115 | 15116 | 2607 | 243 | 23 | 258 | 7 | 660 | 170 | 251 | 214 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 3 |
| 15116 | 15117 | 2603 | 121 | 19 | 633 | 195 | 618 | 249 | 221 | 91 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 3 |
| 15117 | 15118 | 2492 | 134 | 25 | 365 | 117 | 335 | 250 | 220 | 83 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 3 |
| 15118 | 15119 | 2487 | 167 | 28 | 218 | 101 | 242 | 229 | 237 | 119 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 3 |
| 15119 | 15120 | 2475 | 197 | 34 | 319 | 78 | 270 | 189 | 244 | 164 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 3 |
15120 rows × 56 columns
Our dataset consists of 15120 observations from the 4 wilderness areas in the Roosevelt National Forest in Northern Colorado. Each observation is a 30m x 30m patch classified with one of seven different cover types, which will be our target variable.
Pandas profiling is a great tool for getting an initial overview of the dataset, as it provides many diferent insights in just a couple lines of code.
from pandas_profiling import ProfileReport
report = ProfileReport(df, minimal = False)
report